Shareware Grab Bag

home *** CD-ROM | disk | FTP | other *** search

/ Shareware Grab Bag / Shareware Grab Bag.iso / 001 / pctj0486.rqw / pctj0486.rvw

Wrap

Text File | 1986-04-12 | 7KB | 107 lines

Review of the article "Statistical Correlation", by Thomas Madron in the April, 1986 issue of PC Tech Journal. This article could have been a useful addition to the literature on statistical computing methods for microcomputers and could have provided readers with a reasonable introduction to multivariate statistical analysis. However, it is so permeated with incorrect statistical theory and naive computing methods that readers should be warned not to use either the text or the program listings for guidance in writing a statistics package. I am not quibbling over minor discrepencies or over issues that are being honestly debated in the statistical community. Rather, I am challenging the authors' understanding of some of the fundamental concepts of multivariate statistical computing. Specifically, the following are significant misstatements of fact and erroneous interpretations of statistical methods: 1. The text accompanying Figure 1 states "The correlation coefficient is the slope of the 'best fit' straight line through these points." In fact, the correlation coefficient equals the slope of the line times the ratio of the standard deviations of the two variables. 2. On page 128 is the statement "A coefficient of + or - 1.0 implies a completely causal relation between two variables ... ." In fact, a unit correlation only implies that two variables are perfectly associated and says nothing about causal relationships. This is an extremely important distinction that students learn in their first class in correlation. 3. The discussion of the consequences of missing data on page 130 is obscure at best. For example, the statement "A correlation coefficient based on these two variables can have a somewhat different meaning than if all respondents had answered both questions" is meaningless, since the correlation is simply computed on the sample of observations with data present on both variables. 4. The description of Figure 5 is incorrect and incomplete. The title "Sample Correlation Matrix" is wrong, since Figure 5 is a contingency table display of the frequencies of occurrence of the responses to the two questions. While the rows and columns of Figure 5 are never described, the text implies that "3" represents missing data and the valid responses are "1" and "2". In that case, Pearson's product moment correlation is entirely inappropriate to describe the association between two dichotomous variables, since it is used to measure the association between continuous variables. 5. On page 130 the statement "CORL.FOR is a linear analysis, finding a linear least-squares fit and performing a linear transformation to normalize data around 0" is completely incorrect and reveals the authors' total ignorance of the subject, since neither correlation nor linear least-squares normalizes data about anything. If one wanted to normalize the data around 0, one could subtract the mean and divide by the standard deviation to transform each observation to a standard normal deviate. 6. The author has obviously confused the population standard deviation with the estimate of it based on a sample from that population. The glossary on page 132 and the program listing on page 140 both indicate that the denominator of the computed standard deviation is N, when, in fact, the correct value in this case is N-1. There are some cases when N might be justified, but the simple linear model analysis problem is not one of them. 7. On page 140 is the comment "Programs that calculate significance tests usually need an estimate of the number of observations. Subsequent programs use the LOWEST number of observations taken from the lower diagonal matrix as a conservative estimate since any significance tests based on a data matrix with missing data are suspect." Nonsense!. In the first place, one does not estimate the number of observations since one can count them exactly. What the author probably meant to say was that in making multivariate tests of hypotheses with missing data some adjustments may be required to the degrees of freedom for the particular test. However, univariate tests of significance on individual correlations with different numbers of observations are entirely appropriate and valid. In addition, the authors' discussion of the computing issues involved in calculating correlations and standard deviations on microcomputers on pages 128-129 is grossly inadequate. As shown in "Statistical Programs for Microcomputers", by Peter A. Lachenbruch (Byte Magazine, November, 1983) arithmetic on sums and sums of squares can be deceptively treacherous. For example, the authors' computational formula on page 129 was originally developed for mechanical calculators to avoid the need for making two passes through the data. However, when that formula is blindly applied to data that is large in magnitude and has little variation, the results can be totally unpredictable. The problem is compounded by performing the operations on sums and sums of squares in single precision, as the author has done on page 140. At the very least, the potentially disastrous results of accumulating round-off errors can be moderated by performing these operations in double precision. Also, centering the data by subtracting the means before the correlations are computed can more than make up for the added execution time by providing some additional protection against producing meaningless results. These and other issues related to the potential loss of precision in statistical microcomputing are discussed in some detail in the excellent article by Lachenbruch. I wasn't going to expend this effort until I saw that the author intends to publish more articles in the future on test reliabliity, stepwise multiple regression, factor analysis, and other multivariate methods of analysis. Based on the quality of the authors' first effort, the potential for disaster is enormous. I strongly recommend that anyone wanting to do multivariate statistical analysis on their microcomputer get their guidance from someone who has demonstrated at least a minimal level of competency in both statistical methods and statistical computing. In my opinion, this author has done neither. I am frankly amazed that the editorial process at PC Tech Journal is so weak as to allow potentially harmful information like this into print. PCTJ would do its readers a service by having a competent statistician review such articles on statistics before they are published. David N. Ikle', Ph.D. Biostatistician Denver, CO